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^ ■ 1 Introduction 

Penalization is a classical approach to model selection. In short, penalization chooses the model 
minimizing the sum of the empirical risk (how well the model fits data) and of some measure 
^ ; of complexity of the model (called penalty); see FPE [T], AIC 0, Mallows' Cp or Cl [22]. A 

J> ' huge amount of literature exists about penalties proportional to the dimension of the model in 

regression, showing under various assumption sets that dimensionality-based penalties like Cp are 
asymptotically optimal [261 ED [21] , and satisfy non-asymptotic oracle inequalities [121 Effl [HI [E] • 
' Nevertheless, all these results assume data are homoscedastic, that is, the noise-level does not 

^ ■ depend on the position in the feature space, an assumption often questionable in practice. 

Furthermore, Cp is empirically known to fail with heteroscedastic data, as showed for instance 
by simulation studies in [6, 8J. 

In this paper, it is assumed that data can be heteroscedastic, but not necessary with certainty. 
Several estimators adapting to heteroscedasticity have been built thanks to model selection (see 
^ ! [H] and references therein), but always assuming the model collection has a particular form. Up 

• to the best of our knowledge, only cross-validation or resampling-based procedures are built for 

solving a general model selection problem when data are heteroscedastic. This fact was recently 
confirmed, since resampling and F-fold penalties satisfy oracle inequalities for regressogram 
selection when data are heteroscedastic [6l[5]. Nevertheless, adapting to heteroscedasticity with 
resampling usually implies a significant increase of the computational complexity. 

The main goal of the paper is to understand whether the additional computational cost of 
resampling can be avoided, when, and at which price in terms of statistical performance. Let 
us emphasize that determining from data only whether the noise-level is constant is a difficult 
question, since variations of the noise can easily be interpretated as variations of the smoothness 
of the signal, and conversely. Therefore, the problem of choosing an appropriate penalty — in 
particular, between dimensionality-based and resampling-based penalties — must be solved unless 
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homoscedasticity of data is not questionable at all. The answer clearly depends at least on what 
is known about variations of the noise-level, and on the computational power available. 

The framework of the paper is least-squares regression with a random design, see Section [21 
We assume the goal of model selection is efficiency, that is, selecting a least-squares estimator 
with minimal quadratic risk, without assuming the regression function belongs to any of the 
models. Since we deal with a non-asymptotic framework, where the collection of models is 
allowed to grow with the sample size, a model selection procedure is said to be optimal (or 
efficient) when it satisfies an oracle inequality with leading constant (asymptotically) one. A 
classical approach to design optimal procedures is the unbiased risk estimation principle, recalled 
in Section [3l 

The main results of the paper are stated in SectionHl First, all dimensionality-based penalties 
are proved to be suboptimal — that is, the risk of the selected estimator is larger than the risk 
of the oracle multiplied by a factor Ci > 1 — as soon as data are heteroscedastic, for selecting 
among regressogram estimators (Theorem [2]). Note that the restriction to regressograms is 
merely technical, and we expect a similar result holds for general heteroscedastic model selection 
problems. Compared to the oracle inequality satisfied by resampling-based penalties in the 
same framework (Theorem [H recalled in Section [3|), Theorem [2] shows what is lost when using 
dimensionality-based penalties with heteroscedastic data: at least a constant factor Ci > 1. 

Second, Proposition [2] shows that a well-calibrated penalty proportional to the dimension 
of the models does not loose more than a constant factor C2 > Ci compared to the oracle. 
Nevertheless, Cp strongly overfits for some heteroscedastic model selection problems, hence 
loosing a factor tending to infinity with the sample size compared to the oracle (Proposition [3]) . 
Therefore, a proper calibration of dimensionality-based penalties is absolutely required when 
heteroscedasiticy is suspected. 

These theoretical results are completed by a simulation experiment (Section [5|), showing 
a slightly more complex finite-sample behaviour. In particular, when the signal-to-noise ratio 
is rather small, improving a well-calibrated dimensionality-based penalty requires a significant 
increase of the computational complexity. 

Finally, from the results of Sections H] and [5l Section [6] tries to answer the central question of 
the paper: How to choose the penalty for a given model selection problem, taking into account 
prior knowledge on the noise-level and the computational power available? 

All the proofs are made in Section [71 

2 Framework 

In this section, we describe the least-squares regression framework, model selection and the 
penalization approach. Then, typical examples of collections of models and heteroscedastic data 
are introduced. 

2.1 Least-squares regression 

Suppose we observe some data {Xi,Yi), . . . (X„, y„) S x M, independent with common distri- 
bution P, where the feature space X is typically a compact subset of M'^. The goal is to predict 
Y given X, where {X,Y) ~ P is a new data point independent of (Xj, yj)i<j<„. Denoting by s 
the regression function, that is s{x) =E,[Y \ X = x], we can write 
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where o" : A" i— )• M is the heteroscedastic noise level and (ej)i<j<n are i.i.d. centered noise terms; 
£i may depend on Xi , but has mean and variance 1 conditionally on Xi . 

The quality of a predictor t : X y is measured by the quadratic prediction loss 

E(x,y)~p[7(i,(^,>^))] =:^7W where 7(t, (x, y)) = - y 

is the least-squares contrast. The minimizer of P^{t) over the set of all predictors, called Bayes 
predictor, is the regression function s. Therefore, the excess loss is defined as 

t{s,t) :=Pj{t)-P^{s)=E(^x,Y)^p{t{X)-s{X)f . 

Given a particular set of predictors Sm (called a model), the best predictor over Sm is defined 
by 

Sm ■■= argmin^g^^ {Plit) } . 
The empirical counterpart of Sm is the well-known empirical risk minimizer, defined by 

Sm ■■= argmin^g^^ {P„7(t) } 

(when it exists and is unique), where P„ = Y17=i ^{Xi,Yi) empirical distribution func- 

tion; 'Sm is also called least-squares estimator since 7 is the least-squares contrast. 

2.2 Model selection, penalization 

Let us assume that a family of models {Sm)meM„ is given, hence a family of empirical risk 
minimizers {^m)meMn ■ The model selection problem consists in looking for some data-dependent 
m G Ain such that £{s,'sfn) is as small as possible. For instance, it would be convenient to 
prove an oracle inequality of the form 

iis,Sfn)<C inf {£(s,Sm)} + Rn (2) 
m€Mn 

in expectation or with large probability, with leading constant C close to 1 and i?„ = 0{n~^) . 

This paper focuses more precisely on model selection procedures by penalization, which can 
be described as follows. Let pen : 7W„ i— be some penalty function, possibly data-dependent, 
and define 

m G argmin,„g^^ { crit(m) } with crit(m) := P„7(?m) -|- pen(m) . (3) 

The penalty pen(m) can usually be interpretated as a measure of the size of Sm ■ Since the ideal 
criterion crit(m) is the true prediction error Pj (s^), the ideal penalty is 

penid(m) := P7(sm) - Pnl(sm) ■ 

This quantity is unknown because it depends on the true distribution P. A natural idea is 
to choose pen(m) as close as possible to pen^^{m) for every m G 7W„ . This idea leads to the 
well-known unbiased risk estimation principle, which is properly introduced in Section 13.11 For 
instance, when each model Sm is a finite dimensional vector space of dimension Dm and the 
noise-level is constant equal to a. Mallows ^22j proposed the Cp penalty defined by 

2a^Dm 
pencp m = . 
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Penalties proportional to Dm, like Cp, are extensively studied in Section [H 

Among the numerous families of models that can be used, this paper mostly considers "his- 
togram models", where for every m G Ain , Sm is the set of piecewise constant functions w.r.t. 
some fixed partition A^, of X . Note that the least-squares estimator 'Sm on some histogram model 
Sm is also called a regressogram. Then, Sm is a vector space of dimension Dm = Card(Am) 
generated by the family (Ia)^^^^^ . Model selection among a family iSm)meM histogram 
models amounts to select a partition of X among {^m^mGA4n • 

Three arguments motivate the choice of histogram models for this theoretical study. First, 
better intuitions can be obtained on the role of variations of the noise-level cr(-) over X — or 
variations of the smoothness of s — because an histogram models is generated by a localized 
basis ( Ix )x)=A,n ■ Second, histograms have good approximation properties when the regression 
function s is a-Holderian with a G (0, 1] . Third, all important quantities for understanding the 
model selection problem can be precisely controlled and compared, see 

2.3 Examples of histogram model collections 

Let us assume in this section for simplicity that X = [0,1) . We define in this section several col- 
lections of models {Sm)meM„ ' always assuming that each Sm is the histogram model associated 
to some partition A^ of X . 

The most natural (and simple) collection of histogram models is the collection of regular 
histograms ('S'm)^g^(rcg) defined by 
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^ ^' ' l<k<m 



where the maximal dimension M„ < n usually grows with n slightly slower than n ; reasonable 
choices are M„ = [re/2j , M„ = [n/(ln(n))J or M„ = \n/{\n{n)f\ . 

Model selection among the collection of regular histograms then amounts to selecting the 
number Dm G { 1, . . . , M„ } of bins, or equivalently, selecting the bin size 1/Dm among { 1/M„, ... ,1/2,1} . 
The regular collection Ain'^^'^ is a good choice when the distribution of X is close to the uniform 
distribution on [0,1] , the noise-level (j(-) is almost constant on [0,1], and the variations of s 
(measured by are almost constant over X . 

Since we can seldom be sure these three assumptions are satisfied by real data, considering 
other collection of histograms models can be useful in general, in particular for adapting to 
possible heteroscedasticity of data, which is the main topic of the paper. The simplest case of 
collection of histogram models with variable bin size is the collection of histograms with two bin 
sizes and split at 1/2 , (S'm)^g^(rog,i/2) , defined by 



!^{Di,D2) s.t. l<Z?i,Z)2<^|u{l} , 



where Si is the set of constant functions on X and for every m = {Dm,i, Dm,2) G (N\ {0} 



\2 



A-m 



k-1 



u 



Dm,2 + k- l_ Dm,2 + k 



'^Dm,2 '^Dm,2 J J l<fc<£). 



Note that using a collection of models such as (5m) . .{rog,i/2) does not mean that data 
are known to be heteroscedastic; {Sm)^^j^(rcg,i/2) can also be useful when one only suspects 
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Figure 1: One data sample (red '+') and the corresponding oracle estimators (black '-') among 
the family (sm )^g_^(rcg,i/2) . Left: heteroscedastic data {s{x) = x; a{x) = 1 if x < 1/2 and 
cr{x) = 1/20 otherwise) with sample size n = 200. Right: homoscedastic data (s(x) = x/4 if 
a; < 1/2 and s{x) = 1/8 + 2/3 x sin(167rx) otherwise; cr{x) = 1/2) with sample size n = 500 . 



that at least one quantity among a, \s'\ and the density of X w.r.t. the Lebesgue measure 
(Leb) significantly varies over X . Distinguishing between the three phenomena precisely is the 
purpose of model selection: Overfitting occurs when a large noise level is interpretated as large 
variations of s with low noise. The interest of using Mt^'^^^^ is illustrated by Fi guredl where 
two data samples and the corresponding oracle estimators are plotted (left: heteroscedastic with 
s' constant; right: homoscedastic with s' variable). 

Using only two different bin sizes, with a fixed split at 1/2, obviously is not the only collection 
of histograms that may be used. Let us mention here a few examples of alternative histogram 
collections: 

• the split can be put at any fixed position t £ (0, 1) (possibly with different maximal number 
of bins Mn^i and M„_2 on each side of t), leading to the collection (Sm) A^c^cg,*) ■ 

• the position of the split can be variable: 

^{reg,var) ^ y xO^^s.*) where Tn C (0, 1) , for instance % = \ s.t. 1 < k < - 1 



• instead of a single split, one could consider collections with several splits (fixed or not), 
such that {1/3,2/3} or {1/4,1/2,3/4} for instance. 

Remark that Card(A^n*'^'^^^^) < < , and the cardinalities of all other collections are 
smaller than some power of n. Therefore, as explained in Section [3] below, penalization proce- 
dures using an estimator of pen^^{m) for every m £ Mn as a penalty are relevant. This paper 
does not consider collections whose cardinalities grow faster than some power of n, such as the 
ones used for multiple change-point detection. Indeed, the model selection problem is of different 
nature for such collections, and requires the use of different penalties; see for instance [8] about 
this particular problem. 
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Most results of the paper are proved for model selection among (Sm) . .(rcg,i/2) , which 
already captures most of the difficulty of model selection when data are heteroscedastic. The 
simplicity of ('S'm )^g^(reg,i/2) rnay be a drawback for analyzing real data; in the present theo- 
retical study, simplicity helps developing intuitions about the general case. Note that all results 
of the paper can be proved similarly when = j^i^^s^t) and t S (0, 1) is fixed; we conjecture 

these results can be extended to Ain = Ain'^^'^'^^^ , at the price of additional technicalities in 
the proofs. 



3 Unbiased risk estimation principle for heteroscedastic data 

The unbiased risk estimation principle is among the most classical approaches for model selection 
|27| . Let us first summarize it in the general framework. 

3.1 General framework 

Assume that for every m S Ain , crit(m, (Xj, yj)i<j<„) estimates unbiasedly the risk {'Sm) 
of the estimator 'sm ■ Then, an oracle inequality like ([2]) with C ~ 1 should be satisfied by 
any minimizer m of crit(m, (Xj, Yj) over m G Ain ■ For instance, FPE [1], SURE [27] and 
cross-validation [3l[28l[18] are model selection procedures built upon the unbiased risk estimation 
principle. 

When crit(m, {Xi, ii)i<j<„) is a penalized empirical criterion given by ([SD, the unbiased risk 
estimation principle can be rewritten as 

VmG7W„, pen(m) «E[penid(m)] =E[(P-P„)7(s„)] , 

which is also known as Akaike's heuristics or Mallows' heuristics. For instance, AIC p], Cp or 
Cl [22] (see Section BTTj) . covariance penalties [T7] and resampling penalties [T51[B] are penalties 
built upon the unbiased risk estimation principle. 

The unbiased risk estimation principle can lead to oracle inequalities with leading constant 
C = 1 + o(l) when n tends to infinity, by proving that deviations of Pj {'Sm) around its ex- 
pectation are uniformly small with large probability. Such a result can be proved in various 
frameworks as soon as the number of models grows at most polynomially with n, that is, 
Card(7W„) < cxn"-^ for some cm,0(m > 0; see for instance [131 E] and references therein 
for recent results in this direction in the regression framework. 



3.2 Histogram models 



Let Sm be the histogram model associated with a partition of X . Then, the concentration 
inequalities of Section 17.31 show that for most models, the ideal penalty is close to its expec- 
tation. Moreover, the expectation of the ideal penalty can be computed explicitly thanks to 
Proposition m first proved in a previous paper [5]: 



E[penid(m)] = l Yl (2 + <pJ (k)' + (^^)') 
where for every A G A^ , 

{alf ■.= e\{Y - s{X)f X € X] =E\{a{X)y AG A 
PX := P(A G A) and 



E 



Vn G N , Vp G (0, 1] , \6n,p\ < min < L 



(4) 



:s{x)-sm{x)y 

L2 



A G A 



" (np)V4 
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for some absolute constants Li , L2 > . 

When data are homoscedastic, (HD shows that if Dm , nimx^\^ { npx } — 5- 00 and if ||s — Smlloo ~^ 

0, 

\/meMn, E penid(m) h- V cta ~ = pencp("i) , 

so that Cp should yield good model selection performances by the unbiased risk estimation 
principle. When the constant noise-level a is unknown, peuQp can still be used by replacing 
o"^ by some unbiased estimator of o"^ ; see for instance P[I] for a theoretical analysis of the 
performance of Cy with some classical estimator of . 

On the contrary, when data are heteroscedastic, (HD shows that applying the unbiased risk 
estimation principle requires to take into account the variations of a over X. Without prior 
information on o"(-) , building a penalty for the general heteroscedastic framework is a challenging 
problem, for which resampling methods have been successful. 

3.3 Resampling-based penalization 

The resampling heuristics [15] provides a way of estimating the distribution of quantities of 
the form F(P^ P„) , by building randomly from P„ several "resamples" with empirical distribu- 
tion . Then, the distribution of F[Pn, P^) conditionally on P„ mimics the distribution of 
F{P,Pn) . We refer to [6] for more details and references on the resampling heuristics in the 
context of model selection. Since penjj(m) = Fm{P, Pn) , the resampling heuristics can be used 
for estimating E [peni(j(m)] for every m G Ain ■ Depending on how resamples are built, we can 
obtain different kinds of resampling-based penalties, in particular the following three ones. 

First, bootstrap penalties [TU] are obtained with the classical bootstrap resampling scheme, 
where the resample is an n-sample i.i.d. with common distribution Pn ■ Second, general ex- 
changeable resampling schemes can be used for defining the family of (exchangeable) resampling 
penalties [6l[20]. Third, F-fold penalties [5] are a computationally efficient alternative to boot- 
strap and other exchangeable resampling penalties; they follow from the resampling heuristics 
with a subsampling scheme inspired by V-fold cross-validation. 

Let us define here ^-fold penalties, which are of particular interest because of their smaller 
computational cost when V is small. Let F G {2, . . . , n} and {Bj)^^j^y be a fixed partition of 
{ 1, . . . , n} such that sup^ |Card(i?j) — n/V\ < 1 . For every j, define 

and Mm £ Mn, sl^^^ G argmin^g^.^ 
Then, the 1^-fold penalty is defined by 

penvF(m) := E " ^""'^ ) ^ ) ' (5) 

In the least-squares regression framework, exchangeable resampling and V-fold penalties 
have been proved in [6l [5] to satisfy an oracle inequality of the form ^ with leading constant 
C = C{n) — )■ 1 when n — )• 00 . In order to state precisely one of these results, let us introduce a 
set of assumptions, called (AS)hist- 
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Assumption set (AS)hisf For every m G Ain , Sm is the set of piecewise constants functions 
on some fixed partition Am of X , and Ain satisfies: 

(PI) Polynomial complexity of Mn- Card(A^„) < c_A4n"-^ . 

(P2) Richness of Ain- 3mo G Ain s.t. D„iq = Card(Amo) S [^/n,Cnchy/^] ■ 

Moreover, data (Xj,li)i<j<„ are i.i.d. and satisfy: 

(Ab) Data are hounded: \\Yi\\^ < A < co . 

(An) Uniform lower-hound on the noise level: (T{Xi) > a^am > a.s. 

(Ap) The hias decreases like a power of Dm '■ constants /3_ > /3+ > and , C^^ > exist such 
that 

ymeMn, C^Dj- <£{s,sm)< C+DJ+ . 
(Ar^) Lower regularity of the partitions for C{X) : Dm m.mxeA,n G A) } > c^^ . 

Remark 1. Assumption set (AS) hist is shown to be mild and discussed extensively in we 
do not report such a discussion here because it is beyond the scope of the paper. In particular, 
when s is non-constant, a-Holderian for some a £ (0, 1] and X has a lower bounded density 
with respect to the Lebesgue measure on X = [0, 1], assumptions (PI), (P2), (Ap) and (Ar^) 
are satisfied by all the examples of model collections given in Section 12.31 (see in particular [5j 
for a proof of the lower bound in (Ap) for regular partitions, which applies to the examples 
of Section 12.31 since they are "piecewise regular" ) . Note also that all the results of the present 
paper relying on (AS)hist also hold under various alternative assumption sets. For instance, 
(Ab) and (An) can be relaxed, see ^ for details. 

Theorem 1 (Theorem 2 in [5]). Assume that (AS)hist holds true. Then, for every V > 2, a 
constant KoiV) (depending only on V and on the constants appearing in (AS)hist) '^''T'd an event 
of probahility at least 1 — KQ{V)n^^ exist on which, for every 

fhpenVF G argmin^g_^^ { P„7 (sm) + penvF("z) } , 
^(«>%enVF) < (l + (lnn)-i/^) inf {£{s,sm)} . 

In particular, l^-fold penalization is asymptotically optimal: when n tends to infinity, the 
excess loss of the estimator 'Sm^^^^^p is equivalent to the excess loss of the oracle estimator 
'Sm* , defined by m* G argmin^g_y\^^^ {£ {s,'Sm)} ■ A result similar to Theorem [1] has also been 
proved for exchangeable resampling penalties in [6J, under the same assumption set (AS)hist- 
In particular, Theorem [1] is still valid when V = n . Let us emphasize that general unknown 
variations of the noise-level cj(-) are allowed in Theorem [TJ 

Theorem [1] — as well as its equivalent for exchangeable resampling penalties — mostly follows 
from the unbiased risk estimation principle presented in Section I3.lt For every model m G 
Adn , IE [penYp(m)] is close to E [peni(j(m)] whatever the variations of cr(-) , and deviations of 
penYp(m) around its expectation can be properly controlled. The oracle inequality follows, 
thanks to (PI). 

The main drawback of exchangeable resampling penalties, and even l/-fold penalties, is their 
computational cost. Indeed, computing these penalties requires to compute for every m G Ad^ 
a least-squares estimator several times: V times for IZ-fold penalties, at least n times for 



8 



exchangeable resampling penalties. Therefore, except in particular problems for which can 
be computed fastly, all resampling-based penalties can be untractable when n is too large, except 
maybe V-fold penalties with 1/ = 2 or 3. Note that (F-fold) cross-validation methods suffer 
from the same drawback, in addition to their bias which makes them suboptimal when V is 
small, see [5]. 

Furthermore, Theorem [1] could suggest that the performance of 1^-fold penalization does not 
depend on V, so that the best choice always is ^ = 2 which minimizes the computational cost. 
Although this asymptotically holds true at first order, quite a different picture holds when the 
signal-to-noise ratio is small, according to the simulation studies of [5] and of Section [5] below. 
Indeed, the amplitude of deviations of penYp(m) around its expectation decreases with V, so 
that the statistical performance of T^-fold penalties can be much better for large V than for 

V = 2. 

Remark that one could also define hold-out penalties by 

Vm G M-n , penTjofrn,) := — ^^'^^ }. ^, f P„ — ) 7 (s^^ ) where I C { 1, . . . , n} is deterministic, 

n — Lard(i j V / V / 

= cJd(I) ^ "^"^"^ ymeMn, sg^ G argmin,gs^ { P^H ) } , 

which only requires to compute once Sm for each m G Mn ■ The proof of Theorem [1] can then be 
extended to hold-out penalties provided that min { Card(/), n — Card(/) } tends to infinity with 
n fastly enough, for instance when Card(/) w n/2 . Nevertheless, hold-out penalties suffer from 
a larger variability than 2-fold penalties, which leads to quite poor statistical performances. 

Therefore, when computational power is strongly limited and the signal-to-noise ratio is 
small, it may happen that none of the above resampling-based model selection procedures is 
satisfactory in terms of both computational cost and statistical performance. The purpose of 
the next two sections is to investigate whether the dimensionality of the models, which is freely 
available in general, can be used for building a computationally cheap model selection procedure 
with reasonably good statistical performance, in particular compared to F-fold penalties with 

V small. 



4 Dimensionality- based model selection 

Dimensionality as a vector space is the only information about the size of the models that is freely 
available in general. So, when some penalty must be proposed, functions of the dimensionality 
Dm of model Sm are the most natural (and classical) proposals. This section intends to measure 
the statistical performance of such procedures for least-squares regression with heteroscedastic 
data. 

4.1 Examples 

As previously mentioned in Section [2.21 Cp defined by penQp(m) = 2a'^Dm/n is the among most 
classical penalties for least-squares regression [22] . Cp belongs to the family of linear penalties, 
that is, of the form 

KDm , 

where K can either depend on prior information on P (for instance, the value a of the — 
constant — noise- level) or on the sample only. A popular choice is K = 2a'^/n, where o"^ is an 
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estimator of the variance of the noise, see Section 6 of |10| for instance. Birge and Massart |13j 
recently proposed an alternative procedure for choosing K, based upon the "slope heuristics". 

Refined versions of Cp have been proposed, for instance in [13\ \TT\ [25] — always assuming 
homoscedasticity. Most of them are of the form 

pen(m) = F{Dm) (6) 

where F depends on n and o"^ , or an estimator o"^ of when o"^ is unknown. The rest of the 
section focuses on dimensionality-based penalties, that is, penalties of the form ([6]). 

4.2 Characterization of dimensionality-based penalties 

Let us define, for every D G = {Dm s.t. m £ Mn} , 

MdimiD) := argmin^^j^^ d^=d {Pnl {sm)} and Mdim ■= |J Mdim{D) . 

The following lemma shows that any dimensionality-based penalization procedure actually se- 
lects m G TWdim • 

Lemma 1. For every function F : Ain '-^ ^ CLnd any sample (Xj,l^)i<j<„ , 

argmin^g_yvi„ { Pnl ( Sm ) + i^(An) } C A^dim • 
proof of Lemma [II Let mp £ argmin„g_y\^^ { P„7 (Sm ) + F{Dm) } • Then, whatever m S , 

Pnl {Sfhp) + i^(An^) < Pnl {Sm) + F{D^) . (7) 

In particular, ([7]) holds for every m G Ain such that Dm = Dfn^ , for which F{D{np) = F{Dm) ■ 
Therefore, d?]) implies that mp G -M-AimiDmp) ■, hence rhp G M.Aim ■ D 

Lemma [T] shows that despite the variety of functions F that can be used as a penalty, using a 
function of the dimensionality as a penalty always imply selecting among {Sm)m&Md- (^^^Pi^S 
in mind that A^dim is random). Indeed, penalizing with a function of D means that all models 
of a given dimension D are penalized in the same way, so that the empirical risk alone is used for 
selecting among models of the same dimension. By extension, we will call dimensionality-based 
model selection procedure any procedure selecting a.s. fh G A^dim • 

Breiman [13| previously noticed that only a few models — called "RSS-extreme submodels" — 
can be selected by penalties of the form F(Dm) = KDm with K > . Although Breiman stated 
this limitation can be benefic from the computational point of view, results below show that 
this limitation precisely makes the quadratic risk increase when data are heteroscedastic. 

4.3 Pros and cons of dimensionality-based model selection 

As shown by equation ([4]), when data are heteroscedastic, E [penjj(m)] is no longer proportional 
to the dimensionality Dm ■ The expectation of the ideal penalty actually is even not a function 
of Dm in general. Therefore, the unbiased risk estimation principle should prevent anyone from 
using dimensionality-based model selection procedures. 

Nevertheless, dimensionality-based model selection procedures are still used for analyzing 
heteroscedastic data for at least three reasons: 
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• by ignorance of any other trustable model selection procedure than Cp , or of the assump- 
tions of Cp ; 

• because data are (wrongly) assumed to be homoscedastic; 

• because they are simple and have a mild computational cost, no other measure of the size 
of the models being available. 

The last two points can indeed be good reasons, provided that we know what we can loose — in 
terms of quadratic risk — by using a dimensionality-based model selection procedure instead of, 
for instance, some resampling-based penalty. The purpose of the next subsections is to estimate 
theoretically the price of violating the unbiased risk estimation principle in heteroscedastic 
regression. 

4.4 Suboptimality of dimensionality-based penalization 

Theorem [2] below shows that any dimensionality-based penalization procedure fails to attain 
asymptotic optimality for model selection among (Sm), . ,(rcg,i/2) when data are heteroscedas- 
tic. 

Theorem 2. Assume that data (Xi, Yi), . . . , (X„, Yn) £ [0, 1] xM are independent and identically 
distributed {i.i.d.), Xi has a uniform distribution over X and Vz = 1, . . . , n , "Kj = Xj -|- a{Xi)ei 
where {£i)i<i<n are independent such that E [e^ | Xj] = and E [e? | Xj] = 1 . Assume more- 
over that s is twice continuously differentiable, 

\\ei\\^ < E < oo , min I (fja)^ , (fTb)^ I > and {aaf {abf 

,1/2 ,1 

where (ua) := / {ct{x)) dx and {ab) := / {a{x)) dx . 
Jo Jl/2 

Let Ain = Ain'^^'^^'^^ be the model collection defined in Section \2. 3\ with a maximal dimension 
Mn = [n/(ln(n))^J . Then, constants Ki,Ci > and an event of probability at least 1 — Kin~^ 
exist on which, for every function F : M.n ^ K and every fhp G argmin^g_y\^^ { P„7 (?m ) + F{Dm) } ; 

£(s,Sfh^) ><Ci inf {£{s,Sm)} with Ci > 1 . (8) 

meMn 

The constant Ci may only depend on (ua)^ / (cTfo)^; the constant Ki may only depend on E, 

(0■a)^ (cTbf, lls'lloo "-^d \\s"\\^ . 

Theorem [2] is proved in Section [7.4[ 

Remark 2. 1. The right-hand side of ([8]) is of order n^^/^ . Hence, no oracle inequality ^ 
for rhp can be proved with a constant C tending to one when n tends to infinity and a 
remainder term Rn <C n~'^l'^ . 

2. Results similar to Theorem [2] can be proved similarly with other model collections (such 
as the nonregular ones defined Section 12. 3p and with unbounded noises (thanks to con- 
centration inequalities proved in The choice Mn = A^n"^'"*^^^^ in the statement of 
Theorem [2] only intends to keep the proof as simple as possible. 
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Figure 2: Experiment Xl-005. Left: Ideal penalty vs. Dm for one particular sample. A similar 
picture holds in expectation. Right: The path of models that can be selected with penalties 
proportional to penLoo (green squares; the small square corresponds to pen=penLoo) is closest 
to the oracle (red star) than the path of models that can be selected with a dimensionality-based 
penalty (black circles; blue points correspond to linear penalties). 

Theorem [2] is a quite strong result, implying that no dimensionality-based penalization proce- 
dure can satisfy an oracle inequality with leading constant 1 < C < Ci , even a procedure using 
the knowledge of s and a ! The proof of Theorem [2] even shows that the ideal dimensionality- 
based model selection procedure, defined by 

"^dim ^fh(^D*^ where D* € argmin^ig^^ {l{s, Sf^^D) ) } , (9) 

is suboptimal with a large probability. 

The combination of Theorems [1] and [2] shows that when data are heteroscedastic, the price 
to pay for using a dimensionality-based model selection procedure instead of some resampling- 
based penalty is (at least) an increase of the quadratic risk by some multiplying factor Ci > 1 
(except maybe for small sample sizes). Therefore, the computational cost of resampling has its 
counterpart in the quadratic risk. Empirical evidence for the same phenomenon in the context 
of multiple change-points detection can be found in [8]. 

4.5 Illustration of Theorem [2] 

Let us illustrate Theorem [2] and its proof with a simulation experiment, called 'Xl-005': The 
model collection is (Sm), ^ . .(reg,i/2) , and data {Xi,Yi)i<i<n are generated according to ([1]) with 
s{x) = x, cr(x) = (1 + 19 X l3.<i/2)/20, Xi ~ ^^([0,1]) , Ei ~ 7\A(0, 1) and n = 200 data points. 
An example of such data sample is plotted on the left of Figure [TJ together with the oracle 
estimator • 

Then, as remarked previously from Q, the ideal penalty is clearly not a function of the 
dimensionality (Figure [2] left). According to ([1]), the right penalty is not proportional to Dm but 

to Dm,i Jq^'^ a'^{x)dx + Dm,2 fi/2 c^(a;)(ix . The consequence of this fact is that any m G A^dim 
is far from the oracle, as shown by the right of Figure [2j Indeed, minimizing the empirical 
risk over models of a given dimension D leads to put more bins where the noise-level is larger, 
that is, to overfit locally (see also [8] for a deeeper experimental study of this local overfitting 
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Figure 3: Experiment Xl-005. Left: log;^oIP'("^ = ^-Un) represented in using {Dm,i-,Dm,2) 
as coordinates, wliere ^^jjjj is defined hy N = 10 000 samples have been simulated for 
estimating the probabilities. Right: logi^E' [m = m*) using the same representation and the 
same = 10 000 samples. 



phenomenon). Furthermore, Figure [3] shows that on 10 000 samples, m* is almost always far 
from A^dim , and in particular from Jrijjj^ • On the contrary, using a resampling-based penalty 
(possibly multiplied by some factor Cqv > 0) leads to avoid overfitting, and to select a model 
much closer to the oracle (see Figure [2] right). 



4.6 Performance of linear penalties 

Let us now focus on the most classical dimensionality-based penalties, that is, "linear penalties" 
of the form 

pen(m) = KDm (10) 

where K > Q can be data-dependent, but does not depend on m. The first result of this 
subsection is that linear penalties satisfy an oracle inequality ([2]) (with leading constant C > 1) 
provided the constant K in (jlOp is large enough. 

Proposition 2. Assume that (AS)hist holds true. Then, if 

Mm G M.^ , pen(m) = — with K > ||cr|P , 

n 

constants K2,C2 > exist such that with probability at least 1 — K2n~'^ , 

l{s,s^)<C2 inf {l{s,sm)] . (11) 

m£Mn 

The constant K2 may depend on all the constants appearing in the assumptions {that is, cm, 
o:mj Crich, A, (Tmin; , , /3_, c^^ and K; assuming K > 2\\a\\1^, K2 does not depend 
on K). The constant C2 may only depend on K , cJmin and when K > 2 ||cr||^, C2 can be 

made as close as desired to Ka^^^ — 1 at the price of enlarging K2 ■ 

Proposition [2] is proved in Section 17.51 As a consequence of Proposition [21 if we can afford 
loosing a constant factor of order llo"!!^ /c^in in the quadratic risk, a relevant (and computa- 
tionally cheap) strategy is the following: First, estimate an upper bound on . Second, plug 
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it into the penalty pen(m) = K' ||o-||^ D^/n , where K' > 1 remains to be chosen. According; 
to the proof of Proposition [2] and previous results in the homoscedastic case, K' = 2 should be 
a good choice in general. Let us now add a few comments. 

Remark 3. 1. The assumption set (AS)hist is discussed in Remark [1] in Section [3^ and can 
be relaxed in various ways, see [B]. 

2. Proposition [2] can be generalized to other models than sets of piecewise constant functions. 
Indeed, the keystone of the proof of Proposition [2] is that 2a'^-^^ < nD^^E [penj^(m) ] < 

1 1 2 

2 ||(t||qq, which holds for instance when the design is fixed and the models Sm are finite 
dimensional vector spaces. Therefore, using arguments similar to the ones of [12\ [TO ] for 
instance, an oracle inequality like can be proved when models are general vector 
spaces, assuming that the design (Xj)i<j<n is deterministic. 



|2 

really prevents from strong overfitting. In particular, some example can be built where 
strongly overfits. 

<(reg,l/2) 



The second result of this subsection shows that the condition K > ||cj||^ /n in Proposition [5] 

p 



Proposition 3. Let us consider the framework of Section \2.1\ with X = [0, 1], 7W„ = Mn 

with maximal dimension Mn = [n/(ln(n))J, and assume that: 

• (Ab) and (An) hold true {see the definition of (AS)hist)! 

• s E ^.{a^R), that is, yxi,X2 G X, \s{x2) — s{xi)\ < R\x2 — xi\", for some R> and 
a E (0, 1] , 

. ^ = P(X G [0, 1/2]) G (0,1), 

• conditionally on {X £ [0,1/2]}, X has a density w.r.t. the Lebesgue measure which is 
lower bounded by cx,Lcb > . 

// in addition Vm G Ain , pen(m) = KDm/n with K < inf(g[o^i/2] {o-(t)^}, then constants 
K3, K4, > exist such that with probability at least 1 — K^n~'^ , 

„ „ KaU , , 

ln(n) 

and ^(s,sa) > -— — 2 > ln(n) inf {£{s,Srn)} ■ (13) 
(ln(n)) m£Mn 

The constants K3, K4, may depend on A, amin, 01, R, fx, cx,Leb; iiif [0,1/2] {^^} ^'^'^ 
they do not depend on n. 

Proposition [31 which is actually is a corollary of a more general result on minimal penalties — 
Theorem 2 in [9] — is proved in Section [7.61 

Consider in particular the following example: X G [0, 1] with a density w.r.t. Leb equal to 
2/^l[o,i/2] +2(1-^)1(1/2,1] for some fi G (0, 1) and a = o-al[o,i/2] +<7fel{i/2,i] for some aa > at > ■ 
Then, the penalty KDm/n leads to overfitting as soon as < = , which shows that 

the lower bound K > appearing in the proof of Proposition [2l cannot be improved in this 

example. 

Let us now consider Cp, that we naturally generalize to the heteroscedastic case by penQp{m) = 
KDm/n with K = [a{X)'^~\ . In the above example, K = 2^(T^+2(1 — ^)(T^ . So, the condition 
on K in Proposition [3] can be written 

/i + (l-//)^ < I , 
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Table 1: Parameters of the four experiments. 



Experiment 


Xl-005 


SO-1 


XSl-05 


X1-005/X02 


s{x) 


X 


sin(7rx) 


see Eq. (fTCj) 


X 


a{x) 


(l + 19xl,<i/2)/20 


lx>l/2 


(l + l,<l/2)/2 


(l + 19xl,<i/2)/20 


n 


200 


200 


500 


1000 


P(X < 1/2) 


1/2 


1/2 


1/2 


1/5 




[n/(ln(n))J = 37 


[n/(ln(n))J = 37 


[n/(ln(n))J = 80 


[n/(ln(n))2j = 20 



which holds when 

0-2 1 _ 

/X E (0, 1/2) and -| < ^ ^ . (14) 

Therefore, when ()14p holds, Proposition [3] shows that Cp strongly overfits. 

The conclusion of this subsection is that some linear penalties can be used with heteroscedas- 
tic data provided that we can estimate ||c7||^ by some cr^oo such that 2a'^oo > II ^11^ holds with 
probability close to 1. Then the price to pay is an increase of the quadratic risk by a constant 
factor of order max(cj^)/ min((T^) in general. 



5 Simulation study 

This section intends to compare by a simulation study the finite sample performances of the 
model selection procedures studied in the previous sections: dimensionality-based and resampling- 
based procedures. 



5.1 Experiments 

We consider four experiments, called 'Xl-005' (as in SectionSS]), 'Xl-005/i02', 'SO-1' and 'XSl- 
05'. Data {Xi,Yi)i<i<n are generated according to ([T]) with ~ AA(0, 1) and Xi has density 
w.r.t. Leb([0,l]) of the form 2/il[o,i/2] + 2(1 - /i)l(i/2,i] , where = P(Xi < 1/2) G (0,1) . 
The functions s and a , and the values of n and fi, depend on the experiment, see Table [T] and 
Figure m in experiment 'XSl-05', the regression function is given by 



X 

s{x) = -t1x<1/2 + 



12, 

— I — sin I IGvrx 

8 3^ 



lx>l/2 • (15) 



In each experiment, N = 10 000 independent data samples are generated, and the model col- 
lection is (Sm) _ . .(rcg,i/2) , with different values of M„ for computational reasons, see Tabled) 
The signal-to-noise ratio is rather small in the four experimental settings considered here, and 
the collection of models is quite large (Card(A^n) = 1 -|- (M„/2)^ ). Therefore, we can expect 
overpenalization to be necessary (see Section 6.3.2 of [6j for more details on overpenalization) . 



5.2 Procedures compared 

For each sample, the following model selection procedures are compared, where r denotes a 
permutation of {1, . . . ,n} such that (X^(j))^^.^^ is nondecr easing. 
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0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 

X X 

Experiments Xl-005 (left) and Xl-005;u02 (right): one data sample 




0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 

X X 

Experiment XSl-05 (left: regression function; right: one data sample) 



Figure 4: Regression functions and one particular data sample for the four experiments. 
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(A) Epenid: penalization with 

pen(m) = E [penj^Cm)] where peni^{m) = Pj{sm) - Pnl{sm) , 

as defined in Section 12.21 This procedure makes use of the knowledge of the true distri- 
bution P of data. Its model selection performances witness what performances could be 
expected (ideally) from penalization procedures adapting to heteroscedasticity. 

(B) MalEst: penalization with penQp{m) where the variance is estimated as in Section 6 of 
[TU] . that is, 

pen(m) = with cr"^ ■= (^^(2i) " ^r(2i-i) )^ , 

i=l 

Replacing a'^ by E [cj(X)2] doesn't change much the performances of MalEst, see Ap- 
pendix S 



(C) MalMax: penalization with pen(m) = 2 Dm/n (using the knowledge of a). 

(D) HO: hold-out procedure, that is, 

in G arg min | -P^'^T ( s^^M [ , 

where Pn ^ and s^^'' are defined as in Section [3.31 and / c{l,...,n}is uniformly chosen 
among subsets of size n/2 such that VA;G{l,...,n/2}, Card(/n {r(2A; — 1), r(2A;) }) = 1 . 

(E-F-G) VFCV (y-fold cross-vahdation) with V = 2,5 and 10 : 

V 



m G arg mm < 

meMr 



where (i?j)i<j<y is a regular partition of { 1, . . . , y } , uniformly chosen among partitions 
such that Vi~G~{ 1, . . . , y } , VA: G { 1, . . . , n/V] , Card(Sjn{r(i) s.t. kV - V + 1 < i < kV}) 
1. 

(H) penHO: hold-out penalization, as defined in Section 13.31 with the same training set I as 
in procedure (D). 

(I~J-K) penVF (F-fold penalization) with V = 2, 5 and 10, as defined by ([5]), with the same 
partition {Bj)i<j<v as in procedures (E-F-G) respectively. 

(L) penLoo (Leave-one-out penalization), that is, F-fold penalization with V = n and Vj , 

Every penalization procedure was also performed with various overpenalization factors Cqv > 1 , 
that is, with pen(m) replaced by Cqv x pen(m) . Only results with Cqv G {1,2,4} are reported 
in the paper since they summarize well the whole picture. 
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Table 2: Short names for the procedures compared. 



A: Epenid 


E: 2-FCV 


I: pen2-F 


B: MalEst 


F: 5-FCV 


J: pen5-F 


C: MalMax 


G: 10-FCV 


K: penlO-F 


D: HO 


H: penHO 


L: penLoo 



Furthermore, given each penahzation procedure among the above (let us call it 'Pen'), we 
consider the associated ideally calibrated penalization procedure TdPen', which is defined as 
follows: 

"^pen = "ipcn ( -^pen ) where \fK > , mpen(i^) G arg min {P„7 (S^) + Kpen(m)} 
^ \ ^ / meMn 

and K^^^ G arg min { £ ( s , Sff,^^^ (^) ) } . 

In other words, pen(m) is used with the best distribution and data-dependent overpenalization 
factor Kpen ■ Needless to say, 'IdPen' makes use of the knowledge of P , and is only considered 
for experimental comparison. 

When pen(m) = Dm , the above definition defines the ideal linear penalization procedure, 
that we call 'IdLin' (and the selected model is denoted by fhi^^). In addition, we consider the 
ideal dimensionality-based model selection procedure TdDim', defined by ([9]). 

Finally, let us precise that in all the experiments, prior to performing any model selection pro- 
cedure, models Sm such that min^eA^ Card{i s.t. Xi £ X} < 2 are removed from {Sm)meM„ ■ 
Without removing interesting models, this preliminary step intends to provide a fair and clear 
comparison between penLoo (which was defined in [6j including this preliminary step) and other 
procedures. 

The benchmark for comparing model selection performances of the procedures is 

where both expectations are approximated by an average over the N simulated samples. Basi- 
cally, Cor is the constant that should appear in an oracle inequality ^ holding in expectation 
with Rn = 0. We also report the following uncertainty measure of our estimator of Cor , 



V^var(^(g,Sn^)) 

£Cor,N ■■= r—^ „ — 7 , (17) 

V A'E [vcAmdMn ( S, ) J 

where var (resp. E) is approximated by an empirical variance (resp. expectation) over the A'^ 
simulated samples. 

5.3 Results 

The (evaluated) values of Cor ^£Cor,N in the four experiments are given on Figures [5] and [6] (for 
procedures A~L) and in Table [3] (for IdDim and the ideally calibrated penalization procedures). 
In addition, results for experiment 'Xl-005' with various values of the sample size n are presented 
on Figure [71 Note that a few additional results are provided in Appendix 1X1 
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A1 A2A4B1 B2B4C1C2C4 D E F GH1H2H4I1 12 14 J1 J2 J4 K1 K2 K4 L1 L2 L4 



Figure 5: Accuracy indices Cor in Experiment Xl-005 for various model selection procedures. 
Error bars represent £Cov,N ■ Cor is defined by (fT6]) and £Cot,n by (fT7|) . Red circles show which 
are the most accurate procedures, taking uncertainty into account and excluding procedures 
usign E[penj^] . On the x-axis, procedures are named with a letter (whose meaning is given in 
Table [2]) , plus a figure (for penalization procedures) equal to the value of the overpenalization 
constant Cqv ; for instance, J2 means the 5-fold penalty multiplied by 2. See also Table S] and 
Figures [9HT0] in Appendix 1X1 



Table 3: Accuracy indices Cor for "ideal" procedures in four experiments, ±eCar,N ■ See also 
Figures [TTHT2] in Appendix E 



Experiment 


Xl-005 


SO-1 


XSl-05 


Xl-005/i02 


IdLin 
IdDim 


2.065 ±0.010 
1.507 ±0.009 


2.106 ±0.009 
1.595 ±0.008 


1.308 ±0.002 
1.262 ±0.002 


2.211 ±0.009 
1.683 ± 0.008 


IdPenHO 

IdPen2F 

IdPenSF 

IdPenlOF 

IdPenLoo 


2.158 ±0.020 
1.454 ±0.011 

1.377 ±0.008 
1.384 ±0.008 

1.378 ±0.008 


1.785 ±0.012 
1.523 ±0.009 
1.467 ±0.008 
1.446 ± 0.008 
1.458 ±0.008 


1.509 ± 0.005 
1.303 ± 0.003 
1.244 ±0.002 
1.240 ±0.002 
1.233 ±0.002 


1.767 ±0.012 
1.410 ±0.008 

1.413 ±0.007 

1.414 ±0.007 
1.419 ±0.007 


IdEpenid 


1.363 ± 0.008 


1.401 ±0.008 


1.232 ±0.002 


1.410 ±0.007 
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A1 A2A4B1 B2B4C1 C2C4 D E F GH1H2H4I1 12 14 J1 J2 J4 K1 K2 K4 L1 L2 L4 

Experiment Xl-005/i02 




O 



A1 A2A4B1 B2B4C1 C2C4 D E F GH1H2H4I1 12 14 J1 J2 J4 K1 K2 K4 L1 L2 L4 

Experiment SO-1 




A1 A2A4B1 B2B4C1 C2C4 D E F GH1H2H4I1 12 14 J1 J2 J4 K1 K2 K4 L1 L2 L4 

Experiment XSl-05 



Figm'e 6: Same as Figure [5] with the three other experiments. See also Table [Hand Figures [9HlO] 
in Appendix [Al 
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The first remark we can make from Figures [5] and [6] is that changing the overpenaUzation 
factor Cov can make large differences in 

Cor ) was pointed in [S]. We do not address here the 
question of choosing Cov from data since it is beyond the scope of the paper; see Section 11.3.3 
of [1] for suggestions. 

The choice of the overpenahzation factor can be put aside in two ways. First, we can compare 
on Figures [5] and [6] the performances obtained with the best deterministic value of Cov for each 
procedure. Second, we can compare procedures with their best data-driven (and distribution 
dependent) overpenahzation factor, as done in Table [3l Both ways yield the same qualita- 
tive conclusion in the four experiments — up to minor variations — which can be summarized as 
follows. 

Firstly, most resampling-based procedures (that is, VFCV, penVF, penLoo) outperform 
dimensionality-based procedures (MalEst, which strongly overfits, and even MalMax), which 
confirms the theoretical results of Sections E] and [3J In particular, penLoo yields a signifi- 
cant improvement over MalEst and MalMax (by more than 25% in three experiments, and by 
2.3% in XSl-05), and IdPenLoo similarly outperforms IdLin and IdDim (by more than 8.5% 
in three experiments, and by 2.3% in XSl-05). Even penLoo with a well-chosen deterministic 
overpenahzation factor Cov outperforms IdLin by 7% to 18% in experiments Xl-005, SO-1 and 
Xl-005;u02, and penLoo equals the performances of IdLin in experiment XSl-05 (compare Ta- 
ble[3]with Table[l]in Appendix|X]). Figure [7] illustrates the same phenomenon: when the sample 
size increases, the model selection performance of IdLin remains approximately constant (close 
to 2) while the model selection performance Cor of penLoo constantly decreases (with Cov = 1-25 
because overpenahzation is still needed for n = 3 000 and we could not consider larger sample 
sizes for computational reasons). The reasons for this clear advantage of resampling-based pro- 
cedures are the same in the four experiments: As pointed out in Section 14.51 for experiment 
Xl-005, no dimensionality-based model selection procedure can select a model close enough to 
the oracle m* . In particular, figures similar to Figure [3] hold in experiments SO-1, Xl-005/x02, 
and XSl-05, see Figure [13] in Appendix Rl 

Secondly, improving over dimensionality-based procedures requires a significant increase of 
the computational cost. Indeed, PenHO performs significantly worse than MalMax in exper- 
iments Xl-005 and XSl-05, while penHO and MalMax have similar performances in experi- 
ments SO-1 and Xl-005/i02. Furthermore, IdPenHO performs worse than IdDim in the four 
experiments, and even worse than IdLin in experiments Xl-005 and XSl-05. In order to ob- 
tain sensibly better performances than dimensionality-based procedures, our experiments show 
that the computational cost must at least be increased to the one of 5-fold penalization. This 
phenomenon certainly comes from the small signal-to-noise ratio, which makes it difficult to 
estimate precisely the penalty shape by resampling, whereas MalMax can provide reasonably 
good performances thanks to underfitting. 

Finally, let us add that penVF outperforms VFCV (and similarly penHO outperforms HO), 
provided the (deterministic) overpenahzation factor is well-chosen, as shown in a previous paper 

6 Conclusion: How to choose the penalty? 

Combining the theoretical results of Sections [3] and U] with the conclusions of the experiments of 
Section O we can propose an answer to the main question raised in this paper: Which penalty 
should be used for which model selection problem? A visual summary of this conclusion is 
proposed on Figure [51 

Three main factors must be taken into account to answer this question: 
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Figure 7: Model selection performance Cor of IdLin 
and penLoo (multiplied by Cqv = 1.25) func- 
tion of n for experiment Xl-005. Each estimated 
value of Cor is obtained with = 1 000 data sam- 
ples for n < 500 and N = 200 data samples for 
n > 1 000 for computational reasons. Error bars 
represent ec^^m . 



Figure 8: Summary of Section [6) best 
penalty as a function of the computa- 
tional power available, the level of accu- 
racy needed, and the knowledge about 
heteroscedasticity of data. The limit 
between linear and resampling penalties 
goes up when the SNR goes down. 



1. the prior knowledge on the noise- level cr{-), 

2. the trade-off between computational power and statistical accuracy desired, 

3. the signal-to- noise ratio (SNR). 

What is known about a{-) appears as the determinant factor: 

(a) If (t(-) is known to be constant, Cp clearly is the best procedure, compared to cross- 
validation or resampling-based procedures which cannot take into account this information 
about data. 

(d) If it(-) is non-constant but completely known, then the expectation of the ideal penalty 
E [penjj(m)] is entirely known and should be used, following the unbiased risk estimation 
principle. 

Note that Cp or AIC are still often used in case (d), mainly by non-statisticians that probably 
do not know (or do not trust) model selection procedures adapting to heteroscedasticity. This 
paper provides clear theoretical arguments to show them what improvement they could obtain 
by using a properly chosen procedure. 

Choosing a penalty is less simple in the following two intermediate cases, where a trade- 
off must be found between the precise knowledge on a, computational power and statistical 
accuracy: 

(b) ct(-) is probably (almost) constant, but this information is questionable 

(c) (t{-) is known to be non-constant, without prior information on its shape 
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If the computational power has no limits (or, equivalently, if accuracy is crucial), resampling 
penalties (or F-fold penalties with V large) should be used in both cases. 

Nevertheless, when the computational power is limited, one has to take into account that 
F-fold penalties with too small V poorly estimate the shape of the penalty, so that they may 
be outperformed by MalMax (that is, Cp with cr^ replaced by some upper-bound on HcH^), in 
particular in case (b). Similarly, if the final user does not matter loosing a (small) constant 
in the quadratic risk, MalMax could be used instead of resampling in case (b), and even in 
case (c) provided (maxcj/ mino")^ is small enough. Of course, using MalMax requires either the 
knowledge of an upper bound on HcH^, or to be able to estimate one (for instance assuming 
cr(-) does not jump too much). 

The picture can also change depending on the SNR. When the SNR is small, overpenalization 
is usually required. Therefore, choosing a proper overpenalization level can be more important 
than estimating the shape of the penalty, so that MalMax (possibly with an enlarged penalty) 
is quite a reasonable choice in case (b), and even in case (c) depending on the computational 
power. On the contrary, when the SNR is large, F-fold penalties (even with rather small V, 
such as V = 5) yield a significant improvement over any dimensionality-based penalty. 

A natural question arises from this conclusion: How to calibrate precisely the constant in 
front of the penalty? Birge and Massart [13] proposed an optimal (and computationally cheap) 
data-driven procedure answering this question, based upon the concept of minimal penalties 
(see [2] for the heteroscedastic regression framework). Nevertheless, theoretical results on Birge 
and Massart 's procedure are not accurate enough to determine whether it takes into account 
the need for overpenalization when the SNR is small. Therefore, understanding precisely how 
we should overpenalize as a function of the SNR seems a quite important question from the 
practical point of view, which is still widely open, up to the best of our knowledge. 

7 Proofs 

Before proving Theorem [2] and Propositions [2] and [3l let us define some notation and recall 
probabilistic results from other papers El E] that are used in the proof. 

7.1 Notation 

In the rest of the paper, L denotes an absolute constant, not necessarily the same at each 
occurrence. When L is not universal, but depends on pi, . . . ,pi^, it is written Lp-^^^^^^p^,. 
Define, for every model m S , 

pi{m) := P-f {sm) - Pi {sm) P2{in) := Pn'y {sm) - Pnl {sm) and 
^(m):=(P„-P)(7(s™)-7(s)) • 
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7.2 Probabilistic tools: expectations 

Proposition 4 (Proposition 1 and Lemma 7 in ^). Let Sm be the model of histograms associated 
with the partition Am ■ Then, 



E[pi{m)] 
E[p2{m)] 



AGAm 

AeAm 



(18) 
(19) 



where al := e\^{Y - s{X)' 



X £X 



p\ = ¥ {X £ X) and 6n,p only depends on {n,p) . Moreover, 5n,p is small when the product up is 
large: 

\5n,p\ <Ta:m.^Li,L2{npy^/*^ , 

where Li and L2 are absolute constants. 

Note that 6n,p can be made explicit: 6n,p = npK ^Z^^lz>o] ~ 1 wtiere Z is a binomial 
random variable with parameters {n,p) . 

Remark 4. The regressogram estimator 'sm is not defined when Card{z s.t. Xi G X} = for 
some A G A^, which occurs with positive probability. Therefore, a convention for pi{m) as to be 
chosen on this event (which has a small probability, see Claim [1] in the proof of Theorem [2]) so 
that pi{m) has a finite expectation (see [5] for details). This convention is purely formal, since 
the statement of Theorem [2] does not involve the expectation of pi(m). The important point is 
that the same convention is used in Proposition [5] below. 

7.3 Probabilistic tools: concentration inequalities 

We state in this section some concentration results on the components of the ideal penalty, using 
for pi (m) the same convention as in Proposition [H 

Proposition 5 (Proposition 10 in [6j, proved in Section 4 of [7]). Let 7 > 0. Assume that 
minA6A,„ {"-Pa} > > 1, \\Y\\^ < A < co and 

E [(T{Xf I X G A] > Q > . 

AeAm 

Then, an event of probability at least 1 — Ln~'^ exists on which 

pi{m) > E [pi{m)] - LA,Qa [(Inn)' D^^'^ + e"^^"] E [p2{m)] (20) 
pi{m) <E[pi{m)]+LA,Q,^ [(lnn)'D-i/2 + y^e-^^" ] E [p2(m) ] (21) 

|p2(m)-E[p2(m)]| <LA,Q,7^™/'ln(n)E[p2(m)] . (22) 

Lemma 6 (Proposition 8 in [9j). Assume that \\Y\\^ < A < 00. Then for any x > 0, an event 
of probability at least 1 — 2e~^ exists on which 



- '4 8\ A^x 

V7?>0, |(^(m)| < 7/£(s,s„) + ( - + - 

\r] 6 J n 



(23) 
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Lemma 7 (Lemma 12 in [5]). Let {px)x<=Am non-negative real numbers of sum 1, (^^PA)AeA,n 
a multinomial vector of parameters (n; {p\)x&Am); ^''^d, 7 > 0. Assume that Card(Am) < n and 
miiiAgAm {''T'Px} ^ > 0. Then, an event of probability at least 1 — Ln~^ exists on which 

min { npA } > ^ ""^^ ^ - 2(7 + 1) In n . (24) 

AeAm 2 

7.4 Proof of Theorem [2] 

In the following, ^fHThm~i = ^_Eo- o-i llo-'ll II(t"II denotes any constant depending on Oa-, crt,, 
||(t'||^ and only. The outline of the proof of Theorem [2] is the following: 

• From Lemma [U it is sufficient to prove that for every 

inf {^(s,Sm)}>Ci inf {^(s,Sm)} 

meA^dim(-D) mgXn 

holds with probability at least 1 — K\n^'^ , with Ci > 1 . 

• Prove that all regressogram estimators are well-defined with a large probability (Claim [1]). 

• Compute explicitly the bias of each models (Claim [2]). 

• Provide a good approximation of the excess loss and the empirical risk of each model on 
a large probability event (Claim [3]). 

• Upper bound the excess loss of the oracle model (Claim 0]). 

• Lower bound the excess loss of small models (Claim [5]). 

• Prove that all models m having an excess loss close to the one of the oracle are close to 
the oracle model (Claim [6]). 

• Conclude the proof by showing that for every model close to the oracle, a model with a 
smaller empirical risk can be found. 

As pointed out by Remark U the regressogram estimator associated with model Sm is not 
well defined if for some A G Am, no Xi belongs to A. The following claim shows this only happens 
with a small probability, hence this possible problem can be put aside for proving Theorem [2j 

Claim 1. An event Vl\ of probability at least 1 — Ln~'^ exists on which 

ym £ Mn & Am , Card{i s.t. Xi G X} > 1 . 
Hence, on ^li, all estimators (sm)meA^n '^^^ well-defined. 

proof of ClaimU^ For every m G Ain , let us apply Lemma [7| with Bn = (ln(n))^ and 7 = 4. 
An event of probability at least 1 — Ln~'^ exists such that 

min Card{i s.t. G A} > ^ ^ " - lOlnn . 
AeAm 2 

This lower bound is positive provided that n > L. Therefore, the result holds on the intersection 
Oi of these Card(A^„) < events. □ 



25 



The next step is to use the results recalled in Sections 17.21 and 17.31 in order to control the 
excess loss and the empirical risk of each model. This leads to Claims [2] and E] below. 



Claim 2. Define 



ai 



/ ^ J 



48 



1 



(s'(x)) dx and = 7^ / dx 

48 



1/2 



For every m S Ain , some Km,b,i, t^m,b,2 S M exist such that 



(25) 



and l^^^^l < L\\^f\\ D. 



M 00 ' II 1 1 00 



-1 



proof of Claim\^ Since X is uniformly distributed on [0, 1], 



t{s,Sr. 



f {s^{X) - s{X)f = [ {s{ 



x) — s\ )^ dx 



(26) 



where s\ is the average of s on A . We now fix some A G . Let c\ denote the center of the 
interval A , and |A| the length of A . Then, 



{s{x) - sxf dx = {s{cx) - sxf + I {s{x) - s{c\)f' dx 



(27) 

In addition, since s is twice continuously differentiable, for every x G A, some g{x) G A exists 
such that 



1 



s(x) - s(ca) = (x - cx)s'{cx) + -{x - cx) s"{g{x)) 



On the one hand, integrating ([28]) over A leads to 



{sx-sicx)f<L\\s"\\l\X\' . 



On the other hand, integrating the square of ([25j) over A leads to 

^"(ca)|A|^ 



{s{x) - s{cx)) dx 



12 



<L|AnU"|| (\\s'\\ +\\s"\\ ) 

— I I M Moo\M Moo M Moo/ 



Combining ()27p with (|29p and ()30p then shows that for every A G A^, 

„/2/„. ^ I \ |3 



J (six) - sx) 



,,^._."(e.)|Ar 



12 



<L|Ar|U"ll (iis'ii ) 

— I I M MooVM Moo M Moo/ 



(28) 
(29) 

(30) 
(31) 



Furthermore, for every A G A^ 



\M{s'icx)f- / {s'{x)fdx 



< 



{s'{x)Y -{s'{cx)f dx<2||/||^ / \s'{x)-s'{cx)\dx 



<2|U'|| |Ap . (32) 

— M MooM Moo'i ^ / 



Using (j26p . combining (j32p with (j3ip and summing over A G A^ implies 



^(s,s^ 



(lAp/ 



<L||s"|| (lls'll ) y |A|^ 

— M Moo \ 11 Moo M Moo/ / ^ ' ' 

AeAm 

<L||s"|| (lU'll )(d-^\+D-^\ 

— M Moo \ II Moo M Moo/ V "ijJ- 



and the result follows. 



□ 
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Claim 3. Define f3i = 2(cja)^ and f32 = 2((Tf,)^ . An event Q, of probability at least 1 — Ln 
exists on which for every m = {Dm,i, Dm,2) £ -Mn , 



( SjYl ) 

and Pnl (sm) - P7 (s) 



a 



+ 



m,2 



n 



n 



( 1 + Km,0 ) 



(33) 



d2 n2 



m,2 



(1 + 



m,l 



n 



n 



[ 1 + Km,2 ) , 

(34) 

where (Km,i)j=o,i,2,me>!„ sate/y 

maxj |Km,j| s.t i G {0, 1,2} , m G A^„, and min { D^.i, -0^,2 } > (lnn)^| < L(HThm) (Inn)^"*^/^ 

proof of Claim\^ Using the notation of Section [7.11 

^{s,Sm) = (-{s^Sm) +Pi{in) and Pn^t {sm) - Pi {s) = (. {s, Sm) - P2{'m) +'5{m) . 

Let us first compute the expectation of each term. Recall that E[(5(m)] = 0. The bias ^ (s, s^) 
is controlled thanks to Claim [2j By Proposition U E [pi(?7i)] and E [p2{m)] mostly depend on 



al=^ {Y-s{X)y 



X £X 



E 



{six)-s^ix)y 



X £ \ 



+ E 



X £ X 



Leb(A) 



2 1 

(s(x) - Srn{x)) dx + 



Leb(A) 



{a{X)Ydx . 



Precisely, 



Eb2(m)] = i 

2n 1 fl/S r 



n 



[s{X)-Sm{x)f + {a{x)f 



dx + 



2D 



m,2 



n 



1/2 



{S{X) - Sm{x) f + {(J{X))- 



n 



n 



where < R{m,n) = - ( Dm,i [ {s{X) - Smix))"^ dx + D.m.2 [ - Sm{x)f dx] < 

n \ Jo ' Ji/2 J 



^ ( '5, SjYi ) 

(ln(n)) 



2 ' 



since -Dm,j < n/(2(ln(n))^) . Similarly, 

E[pi(m)] = 

where < R'{m,n) < 



PlP>m,l , P2P>m,l \ n I X ^ I D'r ^ 
H \ [l + On) + R {nn,n) 



n 



n 



s , 5^ 



(ln(n))^ 



and \5n\ <L(lnn) 



It now remains to prove that pi{m) — E[pi(m)] and E[p2(fw)] ~ P2{'m) + 5{m) are close 
to zero on a large probability event. The condition on a{-) imply that the last assumption of 
Proposition [5] holds since 

D-J E E [a{Xf I X E A] = ^D„.M' + ^D^A-^f | g > q . 
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Let be the event on which, for every m £ Ain i (|20p ~(j22 p hold with 7 = 4 and i?„ = ln(n)^, 
and (|23|) holds with x = 41n(n) and rj = (ln(n))^^ . Since Card(7W„) < n^, Proposition and 
Lemma [6] show that F (0,2) > 1 — Ln~^. Therefore, the probability of = ili n is larger 
than 1 — Kin^'^ for some absolute constant Ki . 

On Q,, for every m G A4n such that miii{Dm,i, Dm,2} ^ (Inl'^))^) we then have 



\pi{m) -E[pi(m)]| < L(HThm) (Inn)"^ E [P2M] 

\p2{m) - E[p2{m)]\ < L(HThm) (lnn)"^E[p2M] 

J IT^ M^Hs^Sm) , -^^(HThm) (Inn)^ 

and 0(771) < — 1 ^ 

' ' Inn n 

as soon as n > L . Enlarging the constant Ki so that 1 — Kin~'^ < when n is too small yields 
the result. □ 



Claim 4. On Q, 

inf {i(s,Srn)}<-J^(ay'f^l^' + al/'f^2^')(l + L^^^^ . (35) 



proof of Claim^ Let m* G TWn be any model such that 



D 



2ain ^ 



< 1 and 



m*,2 



2a2f^ ^ 



< 1 



As soon as n > i^(HThm)i such an exists and satisfies minjDm,,!, Dm^,2} > (lnn)^ The 
result follows from Claim [31 □ 

Claim 5. For every m G Ain such that min{ Dm^i, L'm^2 } < 



(ln(?7-) 



\12 



In particular, Claims H] and [5] show that for every Ci > , when min | D^',!) } ^ 
(ln(n))^ and n > L(HThm),Ci ' 

Hs,Sm')>Ci inf {£(s,Sm)} ■ 
m&Mn 

proof of Claim\^ First, note that £ {s,'Sm) > £ (s, Sm), and by Claim [2l for every m G A4„ , 



If m G A4„ satisfies min { Dm.,!, -^^,2 } > -^1 (||s'|loo ' IN"lloo) ' then, the lower bound is larger 
than 

ai a2 min{ai,a2} 



2£»^2 2min{i:»m,i,i:»m,2}^ 

Now fix some m G Ain such that min { Dm,i, Dm,2 } ^ (ln(n) )^. Some m' G Ain exists such that 
Sm C Sm', Li < Dm',1 < 2max|Li, (ln(n))^| ; indeed, either m = m' satisfies the condition. 
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or m' can be obtained from m by doubling the number of bins in [0, 1/2] (resp. (1/2, 1]) until 
the required condition is fulfilled. Then, 



p( \->p( \ ^ min{ai,a2} ^ 



M OO ' II 1 1 OO 



2mm{Dm,i,Drn,2} (ln(n)) 



Claim 6. Define for every m G A4n , 

— !— ) > and Cm,2 ■= Dm,2 ( ) > 

2ain / V 2a2n / 

Let A G (0, 1] and define 

??A := - 



□ 



2/3 



Then, on VL, any m G Ain such that 



i {s, Sm) <{1 + Va) inf {i{s,sm)} (36) 

miisi satisfy 

max{|C™,i-l|,|C„,2-l|} < A (37) 

as soon as n > i(HThm),A ■ 

proof of ClaimlB^ Assume that Q holds and let m £ Mn be satisfying ()36p . From Claim [5l we 
know that va.va.{Dm,i,Dm,2} > (ln(n))^ for n > i(HThm),A- 

Define, for every x > —1, f{x) = 2~^/^(l + x)^^ + 2^/^(1 + x) and for every x > 0, 
g{x) = min 1 1, (a; — 1)^ }. Then, (j33p in Claim [3] and Lemma [8] below yield 



Hs,sm) > ^ (a}/'/3fV(C™,i - 1) + al^^l3l^^fiCm,2 " 1)) (l " VThm) (Inn) 



-1/2 



n 

> 



22/3^^2/3 {^I'^^T + 0^2 ^^2^) + 2^/3^2/3 " ^(HThm) (Inn) ^ [a\^^ P^^'^ g{Cm,i) + ai^^l 



Hence, (|35|) and (j36|) imply 

l/3«2/3 , l/3«2/3\ +-£'(HThm),A (Inn) ^^'^ ( 1/3 «2/3 s , 1/3 o2/3 s\ 

16 a/ /3/ +a2' /32' ^ > a/ ff(Cm,i) + a2' /32 5(Cm,2) 

^ ^ l-i(HThm)(lnn) ^ ^ 

In particular, when n > i(HThm),A ) 

g{Cm,i) < I6AV17 < 1 and 5(C„^,2) < I6AV17 < 1 , 

which implies (f37|) . □ 

Lemma 8. Lei / : (-l,+oo) ^Rbe defined by f{x) = 2-^/^{l + x)~^ + 2^/^{l + x). Then, for 
every x > —1, 

f{x) > 3 X 2^2/3 + 3 X 2-^^/'^ min { x^ 1 } . 
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proof of Lemma 0. We apply the Taylor-Lagrange theorem to / (which is infinitely difFeren- 
tiable) at order two, between and x. The result follows since /(O) = 3 X 2-2/3, /'(O) = and 
f"(t) = 6 X 2-2/3 X (1 + 1)-^ > 3 X 2^3-4 [ff^i If i > 1^ the result follows from the fact that 
/' > on [0,+oo). □ 

We now can conclude the proof of Theorem [2j Let us assume that on Q, m G satisfies 
([36|) for some A > to be chosen later. Without loss of generality, we can assume that Ua > Cft , 
hence /3i > /32 • By Claim [6l we have 



2ain 



1/3 



1- A) <Z?,„,i < 



1a\n 



1/3 



(1 + A) 



and 



Therefore, 



('^)'''(i-A)<z^„,<(^) (1 + a; 



1/3 



V /3: 



m,2 



with 



aJ/=^/32'/'(l + A; 



a^/=^/3|/3(l-A) ■ 
Since /3i > /32, we can choose A = A(/3i,/32) > such that 

k<k' = (ai/a2)'/^(/32//3i)'/*^ <i^= (aiM)'/^ • 

Note that Dm,i < i^Dm,2 is equivalent to Z)m,i ^ KDm/i^ + k). Therefore, some m' £ A4n exists 
such that Dm = D^' and —1 < -Dm',1 — -Cm'^/(1 + k) < 0. Then, implies that 



ai 



+ 



02 



d2 n2 



m,2 



, 1 + '^m,l 



ai 



+ 



Q2 



m,2 



n 



n 



(1 + Km,2) + 



1 + 



A-Pm'.l _|_ 1^2 Dm', 2 



n 



n 



> 



+ 



+ 



L 



n 



n 



(HThm) 



(ln(n); 



,1 + i^m',2) 



1/2^-2/3 



Now, remark that the bias term is smaller for 5"^' than for Sm since x i— )• aix ^ + 02 (-D^ — x) ^ 
is decreasing on (0, Z)mK/(l + k)]. Therefore, using the definition of m', 

JD \ D N . /3l(-C'm',l - 1) P2{Dm' ,2 - Dm,2) j , nn-1/2 -2/3 

^'nTl'SmJ -i^nTlSm') > ' \ — ' -^(HThm) ( 1" W ) « ' 



n 



n 



Wl-f52){Dm',l-Dm,l) 



L 



n 



(HThm) 



(ln(n) 



,-1/2^-2/3 



> 



Wl - f32) 



D„ 



L 



n 



(HThm) 



(ln(n) 



-1/2^-2/3 



> ^(HThm)^"'/' - i(HThm) ( ln(n) n-2/3 > Q 



as soon as n > i(HThm)- Therefore, m ^ A^dimi which concludes the proof of Theorem [21 with 

Cl = 1 + 7?A ■ □ 
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7.5 Proof of Proposition [2] 

Let LcHPrn2i = ^ ^ , + y denote a constant (varying from line to line) 

that may only depend on the constants appearing in the assumptions of Proposition [2] (except 
the constant K\ 

According to ()19p in Proposition HI 

2 1 1 1 1 2 

E[p2(m)] = - Y <A> and E[p2(m)] < ^"^ H^Hoo ^ 1 y 

Now, using (Ar^) and (Ap), 



n ^ — ' n n n 

agA™, AeA„ 



2 C^. _ / 



AeA™ agA„ 



so that 



-Dm, / ,, ,,2 Ci 



E[p2(m)]<-^ + 



Therefore, for every m G Ain such that Dm > ln(n) , 

ci{K,n)E, [p2{fn) \ < pen(m) < C2E [p2("i)] 

with C2 = -ft^/Cmin ^^"^ 

= ^ ^— > -|^x 1- > FT^ (1 - (ln(n))-''+/2) > \ 

II ||2 I l-^b (7 1 J _b f / Z 



as soon as n > -^^(HPro2),x • Then, Theorem 5 in |2| shows that with probability at least 

(HPro2)'^ 



l-L— 



» ) — 



l + (^-2 



min<! > I I '"e^" 



inf {i{s,Sm)} 



1 1 2 

which concludes the proof. When K > 2\\a\\^, the leading constant of the oracle inequality is 
smaller than 

K 



+ (ln(n))~^/^ 



mm 



l-(ln(n))"^+/2 

which can be made as close as possible from Ka^^ — 1 provided n > -^^(HPro2),_ft' • ^ 
7.6 Proof of Proposition [3] 

Let -L(HPro3) = -^A,f7„,in,a,i?,M,cjf,Leb,inf[o,i] <^ denote a constant (varying from line to line) that may 
only depend on the constants appearing in the assumptions of Proposition[3] (except the constant 
K). 
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Since Mn = Mn^^'^^'^^ and the penalty can be written pen(m) = ^^^'^ + the model 

selection problem can actually be split into two separate model selection problems: one for the A'^ 
data points for which Xi G [0, 1/2], the other for the n — N data points for which Xi G (1/2, 1] . 

For proving Proposition [3l we can focus on the first problem only, that is, we are given N 
data points independent with distribution C ( (X, Y)\ X G [0, 1/2] ), where N is itself a random 
variable whose distribution is binomial with parameters n and P(X G [0, 1/2]) = /i . The goal is 
to select a model m among the family M{n, N) of regular histograms on [0, 1/2] with a number 
of bins between 1 and n/(21n(n))^. Note that from Bernstein's inequality (see for instance 
Proposition 2.9 in [25]), we have with probability at least 1 — 2n~^ that 

/ ^-r-: 21n(n) , — - 21n(n) 

n/i + 2 V/xn ln(n) H — > N > rifi - 2 V/xn ln(n) — . 

3 3 

In particular, on some event Qn of probability at least 1 — 2n~^ , 

N 



if n > , then 



- 1 



Now, on ) we apply Theorem 2 in [9] . First, let us check that the assumptions of Theorem 2 
in [S] are satisfied: (Ab) and (An) are assumed in Proposition [31 the upper bound on the bias 
of the models holds because s G T-L{a^R) ; the uniform lower bound on P(X G A) holds because 
P(X G [0,1/2]) = /i > and X has a lower bounded density w.r.t. Leb([0, 1/2]) . Finally, we 
need an upper bound on pen(m) = KDffi/N : Using the proof of Proposition [21 we have 

~, pen(m) KDff,N K 
Vm G M(n N) — - — — - — < — = < 1 

^ ' ^' E[p2{m,N)]- NDff,mU^[o,i/2]Um infie[o,i/2] { ^(i)^ ' 

So, Theorem 2 in [9] shows that An,i ^ -^(HPro3),_ft'-^ (lii(-^) )^^ ^ -^(HPro3),_ft'^ (lii("') )~^ with 
probability at least 1 — -Z^(HPro3),_ft:-^~^ = 1 ~ -Z^(HPro3),_ft"^^^ • The lower bound (fT3]) on the risk 
also follows from Theorem 2 in [9] and its proof. □ 
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Table 4: Accuracy indices Cor for each procedure in the four experiments ziz£Cor:,N ■ In each 
column, the most accurate procedures (taking the uncertainty £Cot,n into account) are bolded. 



Experiment 


Xl-005 


SO-1 


XSl-05 


X1-005|U02 


Mai 

Malxl.25 
Malx2 
Malx3 
Malx4 


8.961 ± 0.055 
7.279 ± 0.057 
4.101 ±0.036 
3.074 ± 0.019 
2.862 ± 0.015 


6.056 ± 0.033 
5.086 ± 0.034 
3.242 ± 0.022 
2.675 ± 0.013 
3.214 ±0.017 


3.424 ± 0.020 
2.347 ±0.015 
1.452 ±0.005 
1.338 ± 0.003 
1.891 ±0.014 


4.693 ±0.019 
4.692 ±0.019 
4.562 ± 0.021 
4.033 ± 0.023 
3.420 ± 0.022 


Maloo 

Maloo X 1.25 
Maloo X 2 
Maloo X 3 
Maloo X 4 


4.110 ±0.036 
3.371 ± 0.024 
2.862 ±0.015 
3.033 ± 0.019 
3.549 ± 0.025 


3.242 ± 0.022 
2.794 ±0.015 
3.214 ±0.017 
5.035 ± 0.015 
5.810 ± 0.006 


1.664 ±0.008 
1.452 ±0.005 
1.358 ± 0.004 
3.493 ± 0.019 
5.020 ± 0.009 


3.122 ±0.020 
3.334 ±0.015 
3.370 ± 0.009 
3.430 ± 0.007 
3.493 ± 0.006 


HO 

2FCV 
5FCV 
lOFCV 


3.598 ± 0.036 

3.104 ±0.032 
3.176 ±0.035 
3.291 ±0.037 


2.707 ± 0.021 

2.458 ±0.019 
2.538 ± 0.021 
2.559 ± 0.022 


1.848 ± 0.008 
1.767 ±0.007 
1.749 ±0.008 
1.738 ±0.008 


2.398 ± 0.018 

2.289 ±0.016 
2.332 ±0.018 
2.369 ±0.018 


penHO 
penHOxl.25 
penHO X 2 
penHO X 3 
penHO X 4 


5.070 ± 0.045 
4.393 ± 0.041 
3.595 ± 0.034 
3.516 ±0.032 
3.729 ± 0.033 


3.492 ± 0.027 
3.072 ± 0.024 
2.659 ± 0.020 
2.558 ±0.018 
2.573 ± 0.018 


2.529 ± 0.014 
2.152 ±0.012 
1.853 ±0.008 
1.972 ±0.009 
2.166 ±0.011 


2.798 ± 0.020 
2.626 ± 0.019 

2.751 ± 0.034 
3.634 ± 0.055 
4.663 ± 0.070 


pen2F 

pen2Fxl.25 

pen2Fx2 

pen2Fx3 
pen2Fx4 


4.530 ± 0.043 
3.649 ± 0.037 
2.619 ± 0.028 

2.273 ± 0.023 
2.275 ± 0.022 


3.229 ± 0.025 
2.769 ± 0.022 
2.270 ±0.017 

2.222 ±0.015 
2.381 ± 0.016 


2.325 ± 0.013 
1.945 ± 0.010 
1.619 ± 0.005 

1.539 ±0.005 
1.586 ± 0.007 


2.729 ±0.019 
2.451 ± 0.018 
2.062 ± 0.014 

1.932 ±0.013 
1.907 ± 0.014 


pen5F 

pen5Fxl.25 

pen5Fx2 

pen5Fx3 

pen5Fx4 


3.779 ± 0.041 
2.794 ±0.031 
2.051 ± 0.019 
1.777 ±0.013 
1.838 ± 0.015 


2.857 ±0.024 
2.331 ± 0.018 
1.995 ± 0.012 
2.119 ±0.011 
2.384 ± 0.013 


1.925 ±0.010 
1.646 ± 0.007 
1.457 ± 0.004 
1.388 ±0.003 
1.366 ± 0.003 


2.540 ±0.019 
2.193 ±0.016 
1.880 ±0.011 
1.860 ±0.009 
1.887 ± 0.008 


penlOF 
penlOFxl.25 
penlOF X 2 
penlOFx3 

non 1 OTT V A. 

pen±uJ7 


3.599 ± 0.040 
2.726 ± 0.031 
1.893 ±0.016 
1.709 ± 0.011 

1 yrifi 4- n ni 1 
1 . i uu HI u.U-L 1 


2.726 ± 0.024 
2.215 ±0.018 
1.944 ±0.012 
2.132 ±0.011 


1.810 ±0.009 
1.594 ± 0.006 
1.451 ± 0.004 

1.358 ± 0.003 

1 Q07 _i_ n nno 
J..0Z1 HZ u.uuz 


2.451 ±0.019 
2.125 ±0.016 
1.854 ± 0.010 
1.879 ± 0.008 

i.e/^o zn u.uui 


penLoo 
penLoox 1.25 
penLoo X 2 
penLoox 3 

pcnLoox 1 


3.171 ±0.034 
2.529 ±0.027 
1.870 ±0.014 
1.701 ±0.010 
1.679 ± 0.010 


2.499 ± 0.021 
2.118 ±0.016 
1.954 ±0.012 
2.183 ±0.011 

2.157 ± 0.011 


1.731 ±0.008 
1.548 ±0.006 
1.401 ± 0.003 
1.378 ± 0.003 
1.308 ± 0.002 


2.395 ±0.019 
2.065 ±0.015 
1.879 ± 0.009 
1.931 ± 0.007 

2.002 ± O.OOG 


Epenid 
Epenid X 1.25 
Epenid X 2 
Epenid X 3 
Epenid X 4 


2.805 ± 0.029 
2.304 ± 0.023 
1.780 ±0.012 
1.687 ±0.009 
1.646 ± 0.009 


2.291 ± 0.019 
1.943 ±0.014 
1.897 ±%011 
2.161 ±0.011 
2.448 ± 0.010 


1.702 ± 0.008 
1.513 ±0.005 
1.371 ± 0.003 
1.312 ± 0.002 
1.299 ± 0.002 


2.333 ± 0.019 
2.035 ±0.015 
1.868 ± 0.009 
1.938 ± 0.007 
2.005 ± 0.006 
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Figure 11: Box plot of £ ( s, Sm ) divided by the estimated value of E ( s, 'Sm* ) ] for algorithms Id* 
(that is, penalties with the optimal data-driven overpenalization factor) in experiments Xl-005 
and X1-005//02. 
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Figure 12: Box plot of £(s,Sm) divided by the estimated value of E (s, Sm* )] for algorithms 
Id* (that is, penalties with the optimal data-driven overpenalization factor) in experiments SO-1 
and XSl-05. 
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2468 10 2468 10 

m^j^ Experiment X1-005//02 m* 



Figure 13: Same as Figure [3] for the three other experiments. Left: log^g P (m = m^j^j^ ) rep- 
resented in M? using {Dm,i-,Dm,2) as coordinates, where fii\i^ is defined by Q; = 10 000 
samples have been simulated for estimating the probabilities. Right: log^QP(m = m*) using 
the same representation and the same samples. The distributions of ra* and fn*^i^ are almost 
disjoint. 
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